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ABSTRACT 

Currently there is considerable interest in making use of 
many-core processor architectures, such as Nvidia and 
AMD graphics processing units (GPUs) for scientific com- 
puting. In this work we explore the use of the Open 
Computing Language (OpenCL) for a typical Numerical 
Relativity application: a time-domain Teukolsky equa- 
tion solver (a linear, hyperbolic, partial differential equa- 
tion solver using finite-differencing). OpenCL is the 
only vendor-agnostic and multi-platform parallel comput- 
ing framework that has been adopted by all major pro- 
cessor vendors. Therefore, it allows us to write portable 
source-code and run it on a wide variety of compute hard- 
ware and perform meaningful comparisons. The outcome 
of our experimentation suggests that it is relatively straight- 
forward to obtain order-of-magnitude gains in overall ap- 
plication performance by making use of many-core GPUs 
over multi-core CPUs and this fact is largely independent 
of the specific hardware architecture and vendor We also 
observe that a single high-end GPU can match the perfor- 
mance of a small-sized, message-passing based CPU clus- 
ter. 
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1 Introduction 

Computational scientists and engineers have begun making 
use of many-core GPU architectures because these can pro- 
vide significant gains in the overall performance of many 
numerical simulations at a relatively low cost. However, 
to the average computational scientist, these GPUs usu- 
ally employ a rather unfamiliar and specialized program- 
ming model that often requires advanced knowledge of 
their architecture. In addition, these typically have their 
own vendor- and platform- specific software development 
frameworks (SDKs), that are different from the others in 
significant ways. For example: Nvidia's GPUs use CUDA 
SDK [IJ, AMD's GPUs use Stream SDK 0, the STI Cell 
BE uses IBM Cell SDK \?\, while traditional multi-core 
processors (from Intel, AMD, IBM) typically employ an 
OpenMP-based parallel programming model lH. There- 
fore, the average computational scientist, with limited time 
and resources available to spend on deeply specialized soft- 



ware engineering aspects of these modern processors, is 
simply unable to embrace and make effective use of these 
current technologies for advancing science. 

In 2009, an open standard was proposed by Apple 
to bring the software development for all these different 
processor architectures under a single standard - the Open 
Computing Language (OpenCL) IJ) - and all major multi- 
core processor and GPU vendors (Nvidia, AMD, IBM, In- 
tel) have adopted this standard for their current and future 
hardware. 

The scientific application that we concentrate on in 
this work is an application from our Numerical Relativ- 
ity (NR) community - a time-domain Teukolsky equation 
solver, which is an explicit, finite-difference, linear, hy- 
perbolic (2h-1)D PDE solver that uses the 2nd-order Lax- 
Wendroff numerical evolution scheme ||6l |7l H) . Its worth 
noting that such a solver is in fairly common use, not only 
in our NR community but also in various other fields of 
science and engineering (for example, in Engineering Elec- 
tromagnetics, Optics, Acoustics, Fluid Dynamics and other 
applications of the wave equation); therefore we expect that 
our work would be of strong interest to the larger commu- 
nity of computational scientists. It is also worth pointing 
out that while some NR works ll9][ini[IiJ in the recent past 
have investigated the use of GPUs for finite-difference PDE 
solutions, these have mainly concentrated on making use of 
vendor supplied frameworks and libraries (such as Nvidia's 
CUDA) and achieving maximal performance. In our work 
here, we concentrate entirely on the OpenCL framework 
and pay serious attention to code portability in addition 
to performance. In other words, we run identical OpenCL 
source-code on all the hardware we consider in this work 
(both CPUs and GPUs). This work is amongst the firs{3 de- 
tailed evaluations of OpenCL for scientific computing on a 
wide variety of CPUs and GPUs. Lastly, all the computa- 
tions performed here are in double-precision floating-point 



' It is worth clarifying that earlier this year, we published a research 
article 1121 that used OpenCL to parallelize our EMRI Teukolsky Code 
on an Nvidia Tesla GPU and the Cell BE. However, that particular code 
is critically different from the one under consideration in the sense that 
it additionally has a very mathematically complex source-term i.e. it is 
an inhomogeneous hyperbolic PDE solver; and it is only the source-term 
computation kernel that is parallelized using OpenCL in that work. In our 
present work, we treat the significantly more common, homogeneous PDE 
i.e. the no source-term case, and parallelize all the compute kernels using 
OpenCL. 



accuracy. 

This article is organized as follows: In Section 2, 
we provide a very brief introduction to the OpenCL par- 
allel programming framework. In Section 3, we introduce 
our Teukolsky code, the relevant background gravitational 
physics and the numerical method used by the code. Next, 
we emphasize aspects of OpenCL and the compute hard- 
ware relevant to our implementation in Section 4. In Sec- 
tions 5 & 6, we present the overall performance results 
from our OpenCL-based parallelization efforts. Finally in 
Section 7, we summarize our work and make some conclu- 
sive remarks. 



2 OpenCL 

As mentioned already, OpenCL is a new framework for 
programming across a wide variety of computer hard- 
ware architectures (CPU, GPU, Cell BE, etc). In essence, 
OpenCL incorporates the changes necessary to the pro- 
gramming language C, that allow for parallel computing 
on all these different processor architectures. In addition, 
it establishes numerical precision requirements to provide 
mathematical consistency across the different hardware and 
vendors - a matter that is of significant importance to the 
scientific computing community. Computational scientists 
would need to rewrite the performance intensive routines 
in their codes as OpenCL kernels that would be executed 
on the compute hardware. The OpenCL API provides the 
programmer various functions from locating the OpenCL 
enabled hardware on a system to compiling, submitting, 
queuing and synchronizing the compute kernels on the 
hardware. Finally, it is the OpenCL runtime that actually 
executes the kernels and manages the needed data trans- 
fers in an efficient manner As mentioned already, most 
vendors have released an OpenCL implementation for their 
own hardware. As of the writing of the document, AMD 
and Nvidia have OpenCL freely available for their GPUs. 
IBM has also beta released OpenCL for the Cell BE and 
their Power line of multi-core processors. 

OpenCL is of tremendous value to the scientific com- 
munity because it is open, royalty-free and vendor- and 
platform- neutral. It delivers a high degree of portabil- 
ity across all major forms of current and future compute 
hardware. OpenCL allows us to run identical source-code 
on current (and future) multi-core CPUs, many-core GPUs 
and even hybrid processors such as the Cell BE and AMD 
Fusion, taking advantage of the different flavor of paral- 
lelism that they offer. 



3 Numerical Relativity 

In multiple earlier works |l6l|2l[8l our time-domain Teukol- 
sky equation solver is described in detail. Therefore, we 
simply reproduce some of that content below for complete- 
ness with minimal alterations. 



Several gravitational wave observatories fT3l are cur- 
rently being built all over the world: LIGO in the United 
States, GEO/Virgo in Europe and TAMA in Japan. These 
observatories will open a new window onto the Universe 
by enabling scientists to make astronomical observations 
using a completely new medium - gravitational waves 
(GWs), as opposed to electromagnetic waves (light). These 
GWs were predicted by Einstein's Relativity theory, but 
have not been directly observed because the required exper- 
imental sensitivity was simply not advanced enough, until 
very recently. 

Numerical Relativity is an area of computational 
science that emphasizes the detailed modeling of strong 
sources of GWs - collisions of compact astrophysical ob- 
jects, such as neutron stars and black holes. Thus, NR 
plays an extremely important role in the area of GW as- 
tronomy and gravitational physics, in general. Moreover, 
the NR community has also contributed to the broader 
computational science community by developing an open- 
source, modular, parallel computing infrastructure called 
Cactus |14|. For the purposes of GW data analysis (de- 
tection and parameter estimation), it is critical to have a 
highly-accurate template bank of theoretical waveforms. 
Because of the degree of accuracy necessary and the large 
number of templates required, it is important to develop ef- 
ficient computational methods for generating these theoret- 
ical waveforms. This motivates us to explore parallel com- 
puting frameworks like OpenCL and cutting-edge compute 
hardware like GPUs for NR. 

The specific NR application we have chosen for con- 
sideration in this work is one that evolves the perturbations 
of a rotating (Kerr) black hole i.e. solves the Teukolsky 
equation in the time-domain E |7] [S). This equation is 
essentially a linear wave-equation in Kerr space-time ge- 
ometry. The next two subsections provide more detailed 
information on this equation and the associated numerical 
solver code. 

3.1 Teukolsky Equation 

The Teukolsky master equation describes scalar, vector and 
tensor field perturbations in the space-time of Kerr black 
holes |15|. In Boyer-Lindquist coordinates, this equation 
takes the form 
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where M is the mass of the black hole, a its angular mo- 
mentum per unit mass, A = — 2Mr + and s is the 



"spin weight" of the field. The s = versions of these 
equations describe the radiative degrees of freedom of a 
simple scalar field, and are the equations of interest in this 
work. As mentioned previously, this equation is an ex- 
ample of linear, hyperbolic, homogeneous (3+l)D PDEs 
which are quite common in several areas of science and en- 
gineering, and can be solved numerically using a variety of 
finite-difference schemes. 

3.2 Teukolsky Code 

To solve Eq. ([T]) numerically in time-domain we take the 
approach first introduced by Krivan et al. in Ref. 0. First, 
we make use of Kerr spacetime's axisymmetry and factor 
out the 0-dependence of the Eq. ^ by decomposing the 
solution vj/ into azimuthal m-modes 



(2) 



In this manner the Eq. ([T]l is reduced to a linear system 
of decoupled (2H-l)-dimensional hyperbolic PDEs. Then, 
we rewrite this system in first-order form, by introducing a 
new auxiliary "momentum" field variable, 11. And finally, 
we develop an explicit time-evolution numerical scheme 
for this first-order, linear PDE system using the well- 
known two-step, 2nd-order Lax-Wendroff, finite-difference 
method. Explicit details on this approach can be found in 
Ref. O. 

Each iteration to evolve the system above consists 
of two steps: In the first step, the solution vector u = 
{$ R, $/, Ilfl, 11/} between grid points is obtained from 
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This is used to compute the solution vector at the next time 
step. 
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The angular subscripts are dropped in the above equation 
for clarity. All angular derivatives are computed using 2nd- 
order, centered finite difference expressions. Explicit forms 
for the matrices D and S can be easily found in the relevant 
literature |6|. 

Symmetries of the spheroidal harmonics are used to 
determine the angular boundary conditions: For even \m\ 
modes, we have de^ — at — 0, tt while $ = at 
6* = 0, TT for modes of odd |m|. We set $ and 11 to zero on 
the inner and outer radial boundaries. 



4 OpenCL Implementation 

The first task in our work is to isolate the most compute 
intensive portions of our Teukolsky equation solver code. 
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Upon performing a basic profiling of our code using the 
GNU profiler gprof, we learn that computing the "right- 
hand-sides" of the Lax-Wendroff steps i.e. the quantities 
within the square-brackets of Eqs. ^ and (HI, take nearly 
75% of the application's overall runtime. We anticipate 
that this observation is fairly typical for codes of this type. 
Thus, it is natural to consider accelerating this "right-hand- 
side" computation using data-parallelization on the many 
cores of the GPU. 

A data-parallel model is relatively straightforward to 
implement in a code like ours. We simply perform a 
domain decomposition of our finite-difference numerical 
grid and allocate the different parts of the grid to different 
cores. More specifically, on the GPU, each thread com- 
putes the right-hand-side for a single pair of r and 6 grid 
values. In addition, it is necessary to establish the appro- 
priate data communication between the GPU cores and the 
remaining code that is executing on the CPU - we use 
clEnqueueReadBuffer, clEnqueueWriteBuffer instruc- 
tions to transfer data back-and-forth from main memory 
and we only use global memory on the GPU to simplify 
communication between the GPU cores. We make this 
simplification (only making use of global memory) for the 
stated goal of keeping the code's portability intact, even if 
it impacts performance to some extent]!. 

Unfortunately, this naive approach yields a negligi- 
ble performance gain on the GPU. The reason is that al- 
though the right-hand-side computation is accelerated due 
to the use of the many-cores of the GPU, the time it takes 
to bring that data back-and-forth from main memory so that 
the remaining computation can resume on the CPU, is large 
enough that no overall gain in performance is perceived! 
This outcome is simply due to the poor bandwidth of the 
system's PCI bus on which the GPU is located. To ad- 
dress this issue, we port all the Lax-Wendroff related com- 
pute routines (such as the computation of the evolved fields 
half-way between grid points, the boundary condition im- 
position, updating of the fields using the right-hand-side 
data) as separate kernels onto the GPU. In this manner, no 
communication is necessary with the rest of the computer 
system and we overcome the challenge mentioned above. It 
is worth noting that some of these routines are perhaps not 
ideal for execution on the GPU (for example, some don't 
quite have a high enough arithmetic intensity that is essen- 
tial to obtain high performance from the GPU architecture) 



^Here, we very briefly document our experiences with two GPU- 
specific optimization teclmiques. Besides global memory, there are many 
other kinds of memory available on a GPU. Global memory is lai'ge, but is 
rather slow to access. On the other hand, shared memory is small, but very 
fast to access. If there is a lot of reuse of a particular data, it makes sense 
to store it in shared memory, thus enabling much faster access to this data. 
In our OpenCL Teukolsky code, there is not enough reuse of any memory 
data. Hence, shared memory optimizations yield only few percent of per- 
formance improvement. Another optimization technique that is frequently 
used in GPUs is memory coalescing. To achieve the peak memory band- 
width, global memory accesses needs to be coalesced. This means that 
memory locations accessed by the kernels need to be from contiguous 
memory locations. We note that memory accesses in our OpenCL code 
are fairly regular and therefore memory accesses are already coalesced. 



but we still port these over for execution on the GPU re- 
gardless, simply because our goal is to minimize data trans- 
fer back-and-forth from main memory. This requires a sig- 
nificant amount of additional effort - but one that pays off 
well eventually (as seen in the following section). 

The main limitation that we introduce with this ap- 
proach of running the entire computation on the GPU is 
that we need to able to fit the entire memory requirements 
of the code within the GPU video memory. Given that cur- 
rent high-end GPU offerings support only a few GBs of 
memory, this can be challenging for most NR codes, es- 
pecially those in (3h-1)D. However, a compute cluster with 
multiple GPUs per node, could perhaps overcome this seri- 
ous limitation. Most NR codes are message-passing (MPI) 
parallelized for cluster execution already using domain- 
decomposition. It should not be too difficult to extend that 
approach to a cluster with GPUs as the main compute de- 
vices. For the performance tests in our current work, we 
simply use grid sizes that maximize the use of video mem- 
ory on the (single) GPUs we consider 

Below, we depict a sample kernel from the OpenCL 
code. This kernel computes the evolved fields half-way be- 
tween the grid points using a simple averaging process. The 
array variables are defined as follows: qre and qim are the 
real and imaginary parts of the solution while pre and 
pim are the real and imaginary parts of the "momentum" 
n. The integers a and b label the array indices that are 
relevant to the r and 6 grid values involved in the kernel. 

#pragma OPENCL EXTENSION cl_khr_fp64 : enable 
tdefine idx(b,a) (N*(b)+(a)) 



kernel void kernel_averagel ( 

global double* qre_h, global double* qim_h, 

global double* pre_h, global double* pim_h, 

global double* qre, global double* qim, 

global double* pre, global double* pirn) 

{ 

int b; int a; 

b = get_global_id ( ) ; 
a = get_global_id ( 1 ) ; 

if (a >= 1 && a < N) 
{ 
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In summary, it is worth pointing out that this OpenCL 
implementation of our Teukolsky solver code is fairly 
straightforward. It should also be mentioned that we do 
not make a serious attempt to hand-tune the codes to tai- 
lor them for each architecture, in order to obtain maximal 
performance. As stated earlier, one of our goals is to keep 
the code highly portable, because we aim to run the exact 
same code on both GPUs and CPUs. The only variable 
that we tune (through simple experimentation) in order to 
obtain maximum performance for each architecture is the 
locaLworkjsize. 

5 Performance Results 

In this section, we report on the final results from our 
OpenCL implementation as outlined in the previous sec- 
tion. We begin by documenting the details of the com- 
pute hardware we use to perform the OpenCL code test- 
ing. We use three types of CPUs: Intel "Harpertown" Xeon 
(while this CPU is a bit dated, it is still one of the com- 
mon ones found in compute clusters at the present time, 
and therefore is still of some interest); Intel "Nehalem" 
Xeon (the most current Intel offering); IBM Power? (the 
most current processor from IBM's Power line - one that 
will be used in NSF's Blue Waters petascale system in the 
very near future, and therefore of particular interest). We 
use two types of high-end GPUs: Nvidia's "Fermi" C2050 
(most current CUDA Tesla GPU offering from Nvidia with 
3GB memory); AMD "Radeon" 5870 (most current Stream 
GPU offering from AMD with 1GB memory). In addition, 
we also include results from a traditional, message-passing 
(MPI) based, 10-node compute cluster built using 80 In- 
tel "Harpertown" Xeon cores and high-speed Infiniband 
DDR interconnects. Detailed specifications (GHz, cores, 
etc.) of these processors are available in Table L The 
systems supporting this compute hardware are either run- 
ning Mac OS X (the Nehalem and Harpertown CPUs) or a 
Linux distribution (everything else) as the primary operat- 
ing system. Finally, the most current OpenCL distribution 
available publicly at the time of performing this work, is 
installed on these systems. 

The last column of Table 1 depicts our final perfor- 
mance numberfl It is clearly evident that the GPUs yield 
an order-of-magnitude gain in overall performance over 
multi-core CPUs (even in a dual i.e. 8-core configura- 
tion). Recall, that the entire computation performed here is 
in the context of the double-precision floating-point accu- 
racy. What is also interesting is that this outcome is largely 
vendor and architecture independent. In other words, both 

^When compared with the serial (single-core) code, the OpenCL ver- 
sion yields a performance gain of a factor of 2.5 on 8 CPU cores. This 
modest scaling is fairly typical of NR codes like these. This is mainly 
due to the fact such codes are severely memory bandwidth limited and 
therefore, it is the slow access to system memory that strongly limits their 
overall performance. 



Table 1 . This table depicts the relative values for per- 
formance for several variants of current generation CPUs 
and CPUs. The baseline system here is an Intel "Harper- 
town" Xeon, 8-core, 2.8 GHz CPU running our OpenCL 
code. These values are based on the overall runtimes of 
our Teukolsky code on these different systems. 



Name 


lype 


GHz 


Cores 


Code 


Perf. 


Nehalem 


CPU 


2.3 


8 


OCL 


1.4x 


Power? 


CPU 


3.0 


8 


OCL 


1.4x 


Fermi 


GPU 


1.2 


448 


OCL 


9.9x 


Radeon 


GPU 


0.9 


1600 


OCL 


9.5x 


Cluster 


CPU 


2.5 


80 


MPI 


9.6x 



Nvidia and AMD GPUs provide near identical gains over 
both Intel and IBM CPUs. In fact, all the CPUs considered 
in our study performed comparably and so did the GPUs, 
regardless of vendor or architecture. Note that in particu- 
lar, Nehalem and Power? processor architectures are sig- 
nificantly different (for example, one is x86 while the other 
is PPC), but our OpenCL code yields near identical perfor- 
mance on both. 

What is also striking is that a (single) GPU performs 
comparable to a small-sized CPU cluster! In particular, we 
observe that a high-end Nvidia or AMD GPU can match 
the performance of an 80-core (10-node) Intel Xeon CPU 
cluster with high-speed interconnects. This suggests that 
a multi-GPU desktop system can potentially replace com- 
mon small/mid sized CPU clusters (upto several 100 cores); 
which would yield significant savings in physical space, 
procurement costs, power consumption and a major reduc- 
tion in failure-rate. 

Another observation worth making from our results 
above is that the GPUs also outperform the CPUs in metrics 
such as performance-per-dollar and performance-per-watt 
by an order-of-magnitude. In particular, between the Fermi 
and the Radeon, since they both exhibit the same level of 
performance, but the Radeon is over five (5) times lower in 
cost and also slightly more power efficient, it is the most 
cost-effective compute hardware amongst the ones we con- 
sider in this work. Although, it should also be noted that the 
Radeon only has 1GB of memory, compared to the Fermi 
that is equipped with with a substantially higher, 3GBs. 

Finally, it is worth emphasizing that we run identi- 
cal source-code on all the hardware we consider in this 
work. That is a highly non-trivial benefit offered by 
the OpenCL framework, promising tremendous savings in 
code-development efforts to the computational scientist. 
As mentioned earlier, the goal of our present work is to 
evaluate OpenCL not only from the viewpoint of perfor- 
mance, but also portability. The fact that we are able to run 
the same source-code on processor architectures as differ- 
ent as CPUs and GPUs, and yet achieve high performance, 
speaks very highly of this technology and its potential in 
scientific computing. 



6 Other Related Work 

Since OpenCL is a relatively new framework, currently 
there are not many published results based on it in the rel- 
evant literature. However, there are a few closely related 
works that we mention in this section. Karimi et al. liT6l 
perform a performance comparison of CUDA and OpenCL 
on Nvidia GPUs and find that CUDA is faster on the ker- 
nels they ran. Waage L1?J uses OpenCL to accelerate image 
filtering and obtains comparable results to a CUDA imple- 
mentation. Zhang et al. |T8l use OpenCL to accelerate CT 
reconstruction and image recognition. Brown et al. |fT9ll 
develop an OpenCL and CUDA implementation of molec- 
ular dynamics software LAMMPS and find that OpenCL is 
somewhat slower than CUDA. 



7 Conclusions 

The main goal of this work is to evaluate a new parallel 
code development framework, OpenCL, for scientific com- 
putation. OpenCL is hardware and platform neutral and yet 
able to deliver strong performance i.e. it delivers portabil- 
ity and high performance on all modern many-core GPUs 
and multi-core CPUs. This makes OpenCL potentially very 
attractive to the scientific computation community. 

In this work, we perform one of the first careful eval- 
uations of OpenCL for scientific computing on a wide va- 
riety of currently available CPUs and GPUs. In particular, 
we take a sample application from the Numerical Relativ- 
ity community - a time-domain, linear, finite-difference, 
hyperbolic PDE solver - and implement its entire compu- 
tation as parallel OpenCL kernels. We describe the par- 
allelization approach taken and also the relevant important 
aspects of the considered compute hardware in some detail. 

Our results suggest that it is relatively straightfor- 
ward to obtain order-of-magnitude gains in overall applica- 
tion performance by using current Nvidia or AMD GPUs 
over Intel or IBM CPUs. In addition, this outcome is 
largely vendor and architecture independent. Moreover, 
the OpenCL source-code is identical for all these CPUs and 
GPUs, which is a non-trivial benefit; it promises significant 
savings in parallel code-development and optimization ef- 
forts. 
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